rtftohtml
is a tool to turn your, say, Word documents into documents which may be read
from within the World Wide Web. The format of these documents is called
HyperText Markup Language (HTML).
rtftohtml is able to automatically convert documents stored in RTF (Rich
Text Format)
to HTML. Most word processors in use on UNIX, Macintosh, PC or NeXT systems
can export their documents in RTF format (hint: have a look at the "Save as..."
dialog box of your favorite word processor).
In processing text, rtftohtml chooses HTML markup based on three
characteristics. These are
The filter has built-in rules for dealing with destinations. For paragraph and
text styles, the rules for translation are contained in a file called html-trn.
By modifying this file, you can train rtftohtml to perform the correct
translations for your documents. The most common change that you will need to
make is to add your own paragraph styles to html-trn.
rtftohtml should produce reasonable HTML output for most documents. Here is
what you can expect:
RTFtoHTML converts a linear RTF-Document (that may contain cross references,
index entries and footnotes) into a fully hypertexted set of HTML-Documents.
This document, that you are reading right now, is an example of what you can
expect from using rtftohtml.
The 3.0 release added the following features to rtftohtml:
automatic generation of an activeIndex support for a few of Netscape's HTML extension, such as the <center> tag and background images (see also sectionNavigation panel ) and of course: theUsers Guide
rtftohtml is available for UNIX, Macintosh (Power PC and 68000), DOS, Windows
3.x, Windows 95 and Windows NT systems. The Windows systems are currently
supported by the DOS application which can be run in all of those environments.
This now includes the rtftoweb extensions on all platforms.
Under UNIX and DOS versions rtftohtml is invoked by a command like this:
rtftohtml [options] file [[options] file2...]
file is the name of the RTF-file to convert. By default the
HTML-output will be written to a file with the same basename, but with the
.rtf-extension replaced with .html.
The most common options are (see below for more detailed descriptions of
these):
These should suffice in most cases. When invoking rtftohtml without any
arguments, a list of all available options is printed. Following is a complete
and detailed list of all available command line options,
sorted alphabetically:
To convert a document from RTF (Rich Text Format) to HTML, rtftohtml requires
the contents of the RTF-file to be formatted with a certain set of paragraph
styles.
For example, headings at level 1 must be formatted with the paragraph style
"heading 1" (which is the built-in default for headings anyway; german heading
styles may be called "Überschrift xy", but they appear in the RTF file as
"heading xy", too), lists must be formatted with a paragraph style such as
"numered list" etc. The reason for this is that rtftohtml needs to know which
paragraph styles it should map to which HTML tags. This mapping between styles
and tags can be customized be editing the file html-trn in rtftohtml's
library directory (see sectionhtml-trn
for more), to create a mapping from your own individual paragraph styles to
HTML-tags. Although this is not as complicated as it might seem, I personally
prefer to adjust my Word-documents to use only (or at least mostly) the
paragraph styles recognized by rtftohtml by default. In this chapter I will
stick to this strategy. See section Unknown Paragraph style""
for a few words on how to customize rtftohtml to correctly interpret your own
paragraph style.
To
determine the HTML-Title
for the created HTML-Files (the text between the <title> and
</title> tags), rtftohtml looks for the
\title-token inside the \info-group of the RTF-File. Thus
you should give your RTF-Documents a short, descriptive title in the respective
dialog box of your word processor (should be called something like "File
information").
Another way to specify the document title is via the -T command line
option. For example:
rtftohtml -T "My work of art" art.rtf
Note that this title will also be automatically inserted by rtftohtml into the
first created HTML-File as a level-1-heading. That's why you should usually
delete the very first heading from your RTF-Document (or at least assign a
different paragraph format to that line) and use it as the document title. The
reason for this is to prevent RTFtoHTML from interpreting the headline
of your RTF-Document as a level 1 heading, where it should split.
rtftohtml automatically recognizes and converts bold, italic and underlined
text. If a certain range of text is written using a monospaced font such as
Courier, it also automatically creates monospaced HTML-output for that range.
What fonts are considered to be monospaced can be configured in the file
html-trn in section.TMatch
("monospace fonts -> tt"). By default the fonts "Courier", "Courier New"
are expected to be monospaced.
If you get warning
messages such as "no output translation for ..." when running
rtftohtml you can either replace that character with a less exotic one in your
RTF-file or add a translation to the end of rtftohtml's library file
html-map, such as "character translation".
The newline
character (created by Shift-Return) will be automatically
converted to the
corresponding HTML-tag,
as will the unbreakable space (created by Control-Shift-Space).
Headings
must be formatted with a paragraph style like "heading 1", "heading 2" etc.
(resp. "Überschrift 1" etc.) to be automatically recognized by rtftohtml.
rtftohtml uses these styles to determine when it should split the HTML-file.
The heading level at which splitting should take place can be configured by the
command line switch -hlevel (see sectionCommand line options
). If a heading contains no text (i.e. it is empty) it will be ignored by
rtftohtml.
If the -h switch was present when rtftohtml was invoked, a navigation
panel
will be inserted at the top and at the bottom of every generated HTML file.
This navigation panel will contain the following elements:
rtftohtml will try to use the language of the RTF-file for labelling the
navigation panel. Currently there is support for english, spanish, french and
german. However, if you would like a more fancy-looking panel, with buttons
etc., you can tell rtftohtml (by writing a simple configuration file) what
HTML-code it should use for the individual panel elements. The creation of such
configuration files is described in detail in section Navigation panels.
rtftohtml knows about the following lists (in braces is the name of the
respective paragraph style it expects such lists to be formatted with):
Nested lists
can be created from an RTF document by using a different style for each level
of indentation. The styles "bullet list 1" "numbered list 2" ... represent
different levels of nesting, with "bullet list 1" being at nesting level 1. The
only rule for use is that no levels of nesting are skipped. For example, a
"numbered list 3" paragraph must not appear immediately after a "Normal"
paragraph. It must follow a paragraph with a nesting level of 2 or higher.
An example sequence of paragraph styles to produce a nested list might look
like this:
numbered list bullet list 1 bullet list 2 glossary 2 bullet list 1 numbered list 2
rtftohtml is able automatically convert tables to HTML 3.0 tables.
Graphics are imbedded in RTF in either a binary format or an (ASCII) hex dump of that binary. I have never seen a binary format graphic - I don't think that the filter will process binary correctly. It does handle the hex format of graphics, by converting the hex back into binary and writing the binary to a file. The file extension is chosen by looking at the original type of the graphic. The following list shows the file types and their extensions:
In addition, the filter produces a link to the file containing the graphic.
Now, since the above graphic formats are not very portable, the filter assumes
that you will convert these files to something more useful, like GIF. So the
format of the link is:
<a href="basenameN.ext">Click here for a Picture</a>
where
Since
most Web browser only support images in GIF-format, you will have to convert
the generated PICT- and WMF-files to GIF. For PICT there is picttoppm/ppmtogif,
but for WMF? I don't know of any WMF translators for Unix; for DOS there is
wmf2bmp, whose output could then be converted to GIF via the pbmplus-tools.
From what I understand, WMF is not a pixel- but a vector-graphic format, so
maybe it would be easier to translate WMF to Postscript and then let
Ghostscript do the job of converting to GIF. Any volunteers for writing a
wmftops utility?
You can also change the link to an IMG form.
If you specify the -I command line option, all links to graphics will be of the
form:
<IMG src="basenameN.ext">
There is one other special case. If a graphic is encountered when the filter is
in the process of generating a link, the IMG form of the link is used even
without the -I command line option.
All
kinds of cross references
can be created from within the RTF-file. The reference itself must be formatted
with the attributes "double-underline/hidden" and must follow the standard
HTML-conventions, such as "http://www.w3.org" or "file.txt" or "#mark1". The
"hot" text, that is the text that will appear "clickable" in your Web-browser,
immediately follows the reference and must be double-underlined, but not
hidden.
Anchors for internal cross references
(such as "mark1", corresponding to the example above) must be formatted either
with the attributes "hidden/outline" or "hidden/superscript". For example
this
link will bring you to the list of other features.
If you just want to create a reference to a certain heading
resp. section, it is sufficient to simply format the reference with the color
red. The text of the reference must match the beginning n characters of
the heading, so the references Supplying""
and Supplying a title""
point to the same section.
If an email address
such aschris@sunpack.com
is colored red, rtftohtml will automatically produce a cross reference of type
"mailto".
The same work for all other kinds of URLs,
so if the URLftp://ftp.rrzn.uni-hannover.de/pub/
is colored red, rtftohtml will automatically produce a reference pointing to
that URL.
If
your RTF document contains footnotes
or endnotes,
the filter will place the text of the footnote in a separate HTML document. At
the footnote reference mark, the filter will generate a hypertext link to the
text of the footnote. This works with either automatically numbered footnotes[1], or user supplied footnote reference marks[+]
If you insert index entries
into your RTF-document and give rtftohtml the -x-option, rtftohtml
will generate a hypertext'ish index for the generated HTML-documents. Note that
when using NCSA-Mosaic
as your Web browser you should also tell rtftohtml to insert some text into the
generated anchors by using the command line switch -X text (see
sectionCommand line options
).
The paragraph style "hr" can be used to produce a horizontal line in the HTML output (this will be translated to the <hr> tag).
If
you have text that you do not want to appear in the HTML output, simply format
the text as Hidden and Plain (that is, no underline, outline...)
If you wish to modify the formatting that discards text, you need to change the
entry in html-trn
that specifies "_Discard".
Normally,
if your RTF document contained the text "<cite>hello</cite>", the
translator would output this as:
"<cite>hello</cite>". This ensures that the text
would appear in your HTML output exactly as it appeared in the original RTF
document. If, however, you want the <cite></cite> to be interpreted
as HTML markup, you must format the tags using Hidden and Shadow or Hidden and
Strikethrough. The filter will then send the tags through without translation.
It is also possible to use the paragraph style "HTML" to let rtftohtml
interpret a whole paragraph as being literal HTML.
When the rtftohtml filter produces HTML markup, it keeps track of the nesting
level of tags to ensure that you don't get something like
<b><cite>hello</b></cite> which would be incorrect
markup. If you imbed HTML markup in your document, the filter will NOT be aware
of it. You must ensure that your markup appears correctly nested.
If you wish to modify the formatting for imbedded HTML, you need to change the
entry in html-trn
that specifies "_Literal".
rtftohtml understands a few other paragraph styles by default. These are (among others):
When converting existing documents to rtftohtml you often get a lot of warning message telling you that some paragraph styles are unknown. Now you can either
To add a new paragraph style, simply go to the .PMatch
table contained in the file html-trn and add an entry to the end. Put the name
of the paragraph style (quoted), the nesting level (usually zero) and the name
of the.PTag
entry that should be used.
The file html-trn
is needed by rtftohtml to map character and paragraph styles contained in the
RTF-file to corresponding HTML-tags. It must be readable either from
rtftohtml's library directory (as set in the file makefile.rtftoweb)
or from the directory contained in the environment variable
RTFLIBDIR.
In html-trn there are four tables. They are labelled .PTag,.TTag
,.TMatch
and.PMatch
. These tables begin with the name (in column one) and continue until the next
table starts. All blank lines and lines beginning with a '#' are discarded. '#'
lines are typically used for comments. The tables themselves are composed of
records containing a fixed number of fields which are separated by commas. The
fields are either strings (which should be quoted) integers or bitmasks.
Each
entry in the .PTag table describes an HTML paragraph markup. The format is:
.PTag
#"name","starttag","endtag","col2mark","tabmark","parmark",allowtext,cannest,DeleteCol1,fold,TocStyl
Sample .PTag Entries
"h1","<h1>\n","</h1>\n","\t","\t","<br>\n",0,0,0,1,1
This is a level 1 heading. The "\n" in the start and end-tag fields forcesa newline in the HTML markup. Since newlines are ignored in HTML (except in <pre>) it's only effect is to make the HTML output more readable. There is no difference between the first tab and any other. They both translate to a tab mark. Paragraph marks generate "<br>" followed by a newline (just for looks). Text markup (like <b>) is not allowed within <h1> text, because we leave that up to the HTML client. No nesting is allowed - (see the discussion on nested styles). No text is deleted. Every paragraph using this markup will also generate a level-1 table of contents entry.
"Normal","<p>","</p>\n","\t","\t","<br>\n",1,0,0,1,0
This is the default for normal text. Regular text in HTML has no required start and end-tags. The "\n" in the end-tag field forces a newline in the HTML markup. Since newlines are ignored in HTML (except in <pre>) it's only effect is to make the HTML output more readable. There is no difference between the first tab and any other. They both translate to a tab mark. This markup will generate <p> par1<br>par2</p> for a two paragraph sequence. If you want more spacing between paragraphs, add a second <BR>.
"ul","<ul>\n<li>","</ul>","\t","\t","\n<li>",1,1,0,1,0
This is the entry for unordered lists. This generates a "<ul>\n<li>" at the start of the list and "</ul>/n" at the end. There is no difference between the first tab and any other. They both translate to a tab mark. Paragraph marks generate "<li>" preceded by a newline (just for looks). Text markup (like <b>) is allowed, and this entry may be nested - and it allows others to be nested within it. This allows nested lists. No text is deleted.
"ul-d","<ul>\n<li>","</ul>","\t","\t","\n<li>",1,1,1,1,0
This entry is identical to the previous except that the DeleteCol1 field is set to 1. This is used to remove bullets (which really appear in the RTF) because we don't want to see them in the HTML.
Each
entry in the .TTag table describes an HTML text markup. The format is:
.TTag
"name","starttag","endtag"
Note that unlike the .PTag table, no text markup should appear more than once. (Of course there is no good reason that it should appear.) If you have two entries with <b></b> start and end tags, it would be possible to get HTML of the form <b><b> text</b></b>. I don't know if this is invalid markup, but it sure is ugly.
Each
entry in the .TMatch table describes processing for text styles. The format is:
.TMatch
"Font",FontSize,Match,Mask,"TextStyleName"
The order of bits in the Match and Mask bit-maps are: # v^bDWUHACSOTIB - Bold # v^bDWUHACSOTI - Italic # v^bDWUHACSOT - StrikeThrough # v^bDWUHACSO - Outline # v^bDWUHACS - Shadow # v^bDWUHAC - SmallCaps # v^bDWUHA - AllCaps # v^bDWUH - Hidden # v^bDWU - Underline # v^bDW - Word Underline # v^bD - Dotted Underline # v^b - Double Underline # v^ - SuperScript # v - SubScript
Sample .TMatch Entries
# double-underline/not hidden -> hot text # double-underline/hidden -> href # v^bDWUHACSOTIB,v^bDWUHACSOTIB "",0,00100000000000,00100010000000,"_Hot" "",0,00100010000000,00100010000000,"_HRef"
The first entry will match any text formatted with double underline EXCEPT if it is hidden text. This is accomplished by using those two bits to compare (the MASK field) and having a 1 in the double underline bit and a zero for the hidden text bit. The second entry will match any text formatted with BOTH double underline and hidden text. Any text that matches the first will be treated as the hot text of a link. Any text that matches the second will be taken as the href itself. (The filter requires that the HRef text immediately precede the Hot text.)
# Regular matches - You can have multiple of these active # monospace fonts -> tt "Courier",0,00000000000000,00000000000000,"tt"
This will match any text that uses the Courier font and mark it using the HTML text markup appearing in the .TTag table with the entry name "tt".
# bold -> bold # v^bDWUIACSOTIB,v^bDWUIACSOTIB "",0,00000000000001,00000000000001,"b"
This will match any text that has bold attributes and will mark it using the HTML text markup appearing in the .TTag table with the entry name "b". Note that bold text using the Courier font would match both this entry and the previous. This will yeild markup of the form <b><tt>hi</tt><b>. Note that "b" is the name of an entry in the .TTag table, not the HTML markup that is used!
Each
entry in the .PMatch correlates a paragraph style
name to some entry in the .PTag table. The format is:
.PMatch
"Paragraph Style",nesting_level,"PTagName"
Sample .PMatch Entries
"heading 1",0,"h1"
This is a level 1 heading. Any paragraphs with this paragraph style will be mapped to the entry in the .PTag table named "h1".
"numbered list",0,"ol-d"
This is used for numbered lists. Any paragraphs with this paragraph style will be mapped to the entry in the .PTag table named "ol-d".
"numbered list 2",2,"ol-d"
This is an entry for a nested paragraph style. The nesting level of two is used to indicate that this paragraph should appear in the HTML nested within two levels of paragraph markups. The paragraph marked with this style may only appear after a paragraph style that has a nesting level of 1 or greater.
The
.Strings table sets values to be used by the translator
format is "name","value"
The quotes may be omitted for numbers, but the two lines:
"TheValue","-1"
"TheValue,-1
are identical. Any name can be defined by the Strings table, but only those
outlined below will have any affect on translations.
String |
Meaning |
SkipNavPanel |
If set to 1, the navigation panel usually produced at the top and bottom of pages will be omitted. This is primarily useful when you are not splitting documents, or producing an index. |
SkipLeadingToc |
At the start of each HTML file, you will normally find a list of all of the headings contained within that file. By setting SkipLeadingToc to 1, this list will be omitted. |
SkipTrailingToc |
Same as above only for trailing lists. |
SkipDimsOnIMGs |
For IMG tags, the filter will generate HEIGHT/WIDTH attributes. Setting this string to 1 will omit these tags. |
WMFAdjust |
In producing HEIGHT/WIDTH attributes for WMF graphics, a scaling factor of 133% is normall applied (by experimentation this scaling appears to be necessary.) If you find that your graphics are not appearing the same size as in you documents, you may modify this scaling factor. (133 is the default, meaning 133%) |
DefaultTitle |
If no title is found in the document, and no title is supplied on the command line, this string is used for a title value. |
AllBorder |
If this is set to 1, all tables will have the BORDER attribute. Normally, only those that have a border on the first cell will have borders. |
ExtTarget |
Defines a tag that will be included in the <A href=xxx > tag, for external hrefs. If set to 'TARGET="_top" clicking on that hyperlink in a framed document will replace the entire window, not just the current frame. |
FNSep |
This tag is inserted before Footnotes at the end of the page |
PreferrEmbed |
If set, choose embedded graphic over linked |
ShortFileDigits |
This is the number of digits to use in Short file names (-s option.) If set to 2 (the default) it will allow splitting into 99 files, 3 would support 999 files. |
DoAllPars |
By default, empty paragraphs are deleted. Setting to 1 will generate all paragraphs (note that empty paragraphs in a list will generate bullet or sequence numbers without any following text.) |
The
LXForm table is used when an RTF document contains a link to an image instead
of the image itself. Since the link contains a filename which is local to the
machine that the RTF file was produced on, some translation of that filename is
required. The LXForm allows you to specify a series of transformations to be
done to the filename. Each line in the LXForm table is applied in order to the
filename.
The format of the table is:
"tag"[,"regexp","regsub"]
Where tag can be
"LOWER" |
translate to lower case |
"UPPER" |
translate to upper case |
"ENCODE" |
encode "bad" characters as hex |
"DECODE" |
decode "bad" characters from hex# |
"SUBSTITUTE |
"perform a regular expression style substitution where "regexp" is a regular expression search pattern and "regsup" is a substition pattern |
For example, the following table:
.LFXform # translate all special characters to %nn format "ENCODE" # strip directory from file name "SUBSTITUTE",".*%5C","" # Strip extension if it exists "SUBSTITUTE","\.[^.]*$","" # Add a .gif extension "SUBSTITUTE","$",".gif"
would make the following tansformations given the filename "C:\ PICTS\CLOWN.WMF"
Transformation |
Output |
"ENCODE" |
C%3A%5CPICTS%5CCLOWN.WMF |
"SUBSTITUTE",".*%5C","" |
CLOWN.WMF |
"SUBSTITUTE","\.[^.]*$","" |
CLOWN |
"SUBSTITUTE","$",".gif" |
CLOWN.gif |
If you want the navigation panels
produced by rtftohtml (see sectionHeadings
) to look more spiffy, e.g. with images as panel buttons,
or if you want the generated HTML documents to use images as their background
or another text color, this section is for you.
By using the -NCommand line option
when invoking rtftohtml, it is possible to tell rtftohtml exactly how you
want the created navigation panels to look like. The same configuration file
can be used to add a few funny Netscapisms
to the generated documents. If no -N-option was given, but rtftohtml
finds a file named nav-panl in its library directory or the directory
contained in the environment variable RTFLIBDIR it will use this file
as the layout customization file. This way you can avoid having to add the
-N command line options whenever you use rtftohtml.
An example for such a customization file is the file nav-panl, which has also
been used when this guide was converted to HTML. By looking at this file you
should easily see how the layout of your documents can be adjusted tou your
taste.
Each line of such a customization file contains the definition of a layout
element, as long as the first character is not the hash-character (#), which
introduces comments. Everything that follows the first colon (:) in each line
will be literally inserted into the HTML-files when needed.
The following elements may be configured:
[1] Look, there is one now!
[+] There is my mark.